{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "其实在做特征跟调参的期间，我就一直想看看数据中是否有重复数据，因为在做rank_encoding的时候就发现数据中有很大的漏洞，只不过在做特征的前向选择与后向选择实在太花时间了(主要原因还是代码能力弱，没法自动化)，一直到做完特征选择之后才去认真看了下原数据。\n",
    "\n",
    "首先是一部分特征存在等级划分，如'Region'>'BusLoc'>'Neighborhood'，这是地理上的等级；然后是'TolHeight'>'Height'>'RoomDir'，这是每套房屋的等级；最后是房屋内部的等级，'Bedroom'>'Livingroom'>'Bathroom'>'RoomArea'。当然这个等级的次序不同的人有不同的理解，以上次序只是我个人的理解。划分出这些等级的目的其实就是想精准定位出'房屋ID'这个属性，然后就可以找出测试集跟训练集是否有重复数据，对于同一个出租屋，直接用它的历史租金来填充它4月份的租金即可，那么这部分数据就不需要使用模型来预测了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "train_data=pd.read_csv('data/train.csv').fillna(-999)\n",
    "test_data=pd.read_csv('data/test.csv').fillna(-999)\n",
    "\n",
    "# 为了便与合并，将除目标值的所有列字符串化\n",
    "def objectal(df):\n",
    "    for col in df.columns:\n",
    "        if col!='Rental':\n",
    "            df[col]   = df[col].astype(str)\n",
    "    return df\n",
    "train_data=objectal(train_data)\n",
    "test_data=objectal(test_data)\n",
    "\n",
    "mon1_train_df=train_data[train_data.loc[:,'Time']=='1'].drop(['Time','RentRoom'],axis=1).drop_duplicates()\n",
    "mon2_train_df=train_data[train_data.loc[:,'Time']=='2'].drop(['Time','RentRoom'],axis=1).drop_duplicates()\n",
    "mon3_train_df=train_data[train_data.loc[:,'Time']=='3'].drop(['Time','RentRoom'],axis=1).drop_duplicates()\n",
    "\n",
    "common_cols=list(mon1_train_df.columns)\n",
    "common_cols.remove('Rental')\n",
    "\n",
    "# 按月计算出房屋的均租金\n",
    "mon1_train_df=mon1_train_df.groupby(common_cols,as_index=False).mean()\n",
    "mon2_train_df=mon2_train_df.groupby(common_cols,as_index=False).mean()\n",
    "mon3_train_df=mon3_train_df.groupby(common_cols,as_index=False).mean()\n",
    "\n",
    "# 二月并一月，缺失值由一月数据来填充\n",
    "recent_mean_rental=mon2_train_df.merge(mon1_train_df,how='outer',on=common_cols).fillna(method='bfill',axis=1)\n",
    "recent_mean_rental=recent_mean_rental.drop(['Rental_y'],axis=1).rename(columns={'Rental_x':'Rental'})\n",
    "# 三月并二月，缺失值由二月(一月)来填充\n",
    "recent_mean_rental=mon3_train_df.merge(recent_mean_rental,how='outer',on=common_cols).fillna(method='bfill',axis=1)\n",
    "recent_mean_rental=recent_mean_rental.drop(['Rental_y'],axis=1).rename(columns={'Rental_x':'Rental'})\n",
    "\n",
    "statistic_pred=test_data.merge(recent_mean_rental,how='left',on=common_cols)\n",
    "statistic_pred.loc[:,['id','Rental']].to_csv('./result/statistic_pred.csv',index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将这个csv文件覆盖掉队友的LGB预测文件之后，线上分数从1.86升至1.82；\n",
    "\n",
    "先将XGB与LGB均化融合，在用这个csv覆盖掉融合结果，线上分数从1.82升至1.80。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
